ACCELERATING NESTEROV'S METHOD FOR STRONGLY 
CONVEX FUNCTIONS WITH LIPSCHITZ GRADIENT 
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Abstract. We modify Nesterov's constant step gradient method for strongly convex functions 
with Lipschitz continuous gradient described in Nesterov's book. Nesterov shows that /(xfc) — /* < 
L]~[^_ 1 (l — tife)||xo — with aj. = ^fp for all k, where L is the Lipschitz gradient constant and 

p is the reciprocal condition number of f(x). Hence the convergence rate is 1 — ^/p. In this work, 
we try to accelerate Nesterov's method by adaptively searching for an > ^fp~ at each iteration. 
The proposed method evaluates the gradient function at most twice per iteration and has some extra 
Level 1 BLAS operations. Theoretically, in the worst case, it takes the same number of iterations as 
Nesterov's method does but doubles the gradient calls. However, in practice, the proposed method 
effectively accelerates the speed of convergence for many problems including a smoothed basis pursuit 
denoising problem. 

Key words, first-order method, gradient method, Nesterov's optimal method, strongly convex 
function, strong convexity, Lipschitz continuous gradient, basis pursuit denoising, BDPN 

AMS subject classifications. 90C25, 90C06, 65F10. 

1. Introduction. First-order methods for convex optimization have drawn great 
interest in recent years as the problem scale goes larger and larger. High-order meth- 
ods do not fit the scene quite well because they generally need more memory than first- 
order methods and take many more operations per iteration. However, the slow con- 
vergence rate of first-order methods prevents them from practical use. For example, 
the constant step gradient descent method converges at the speed of 0(l/k) for func- 
tions with Lipschitz gradient (with constant L), where k is the number of iterations. It 
means that we need one million iterations to reach f(xk) ~ f* < C(10 _6 )(/(a;o) — /*)• 
Nesterov j5] advanced the field with a first-order method converging at the speed of 
0(l/k 2 ). We refer to this method as Ml- To reach the same precision as in the 
previous example, Ml only needs one thousand iterations. Nesterov not only shows 
the method is faster than the gradient descent method but also shows that it is op- 
timal among all first-order methods on functions with Lipschitz gradient. To seek a 
first-order method with higher-order convergence, we have to restrict the functions of 
interest. Nesterov [5] considered functions with both Lipschitz gradient and strong 
convexity (with parameter fi), and he constructed another first-order method with 
linear convergence rate, referred to as M^^l- The gradient descent method can also 
achieve linear convergence on those functions. Nevertheless, to reach a given preci- 
sion, the number of iterations the gradient descent method needs is O(k), where k is 
the condition number of the objective function, while the number of iterations M^l 
needs is only 0(^/k), which is proved to be optimal too. 

In this work, we are interested in accelerating M^ t L in a practical way. In section 
|2j we briefly review how Nesterov constructs M^.l- Then we present our modification 
to Nesterov's method in section [3} Related work on improving Nesterov's methods is 
discussed in section |4j and section [5] reports numerical results. 

2. Nesterov's method. We briefly review Nesterov's constant step gradient 
method for strongly convex functions with Lipschitz gradient, referred to as M^^l, 
and its convergence properties. The content is mostly taken from Nesterov j6 j with 
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some simplifications. We keep this section short and concise but detail how Nesterov 
constructs the method because our modification is based on it. We begin with the 
definition of S^.l, the class of strongly convex functions with Lipschitz gradient, and 
an assumption on first-order methods. 

Definition 2.1. A continuous differentiable function f(x) is in S^i^fi) for some 
L > /i > if for any x, y G CI we have both of the following: 

\\f'(x)-f'(y)\\ 2 <L\\x-y\\ 2 , (2.1) 



f(y)>f(x) + (f(xhy-x) + ^\\y-x\\l (2.2) 

The value k — L/fj, is called the condition number of f(x) and p = 1/k is called the 
reciprocal condition number of f[x). 

Throughout, we assume that for a function from S^l either /i and L or a lower 
bound of \i and an upper bound of L are given. 

Assumption 2.1. [6, p. 59] A first-order method generates a sequence of points 
{x k } such that x k e x a + Span{f'(x Q ), . . . , f'(xk-i)} ,k>l. 

For functions in S^l, Nesterov constructs a first-order method, A/^l, and shows 
that it matches a lower complexity bound for first-order methods satisfying Assump- 
tion [2T] up to a constant factor in the sense of worst-case number of iterations. Nes- 



terov [B] gives more details on the optimality. Note that Assumption 2.1 is very mild, 
as most first-order methods fall into the framework, which secures the optimality of 
A^l- To construct such an optimal first-order method, Nesterov introduces an es- 
timate sequence and shows how it helps derive Afn,L an d prove its convergence rate. 

Definition 2.2. [5J p. 72] A pair of sequences {(j>k(x)} and {Xk}, Xk> is called 
an estimate sequence of f{x) if Xk —> and we have 

4>k{x) < (1 - X k )f(x) + X k <j> Q (x), Vx e E" and k>0. (2.3) 



Lemma 2.3. [5J p. 72] If the pair of sequences {4>k(x)} and {X k } is an estimate 
sequence of f(x) and for some sequence {x k } we have 

f(x k ) < <p* k = min <j> k {x), (2.4) 

then f(xk) — f* < Xk[4>o(x*) — /*] — > 0, where x* is the optimal value of f(x). 

Now the question becomes, given f(x) £ l, how can we construct an estimate 
sequence of f(x) and generate a sequence {xk} satisfying (2.4 1. To construct an 
estimate sequence, we have the following lemma. 

Lemma 2.4. p. 72] Assume the following: 

1. feS^ L (R n ), 

2. <f>o(x) is an arbitrary function on M. n , 
3- {yk} is an arbitrary sequence in R n , 

4. {a k } : a k £ (0, 1), J^kLo a k = °°> 

5. A = 1. 

Then the pair of sequences {(/)k{x)}, {Xk} recursively defined by 

X k+1 = (1 - a k )X k , (2.5) 
4>k+i(x) = (1 - a k )<j)k{x) + a k f{yk) + (f'{yk),x-y k ) + ^\\x-yk\\l , (2.6) 



is an estimate sequence. 

We see that Lemma 2.4 leaves us freedom in the choice of <j>o(x), {yk}, and {a^}. 
To combine the result from Lemma 2.3 we should choose a simple 4>o{x) such that <p*k 
is easy to obtain in explicit form, and choose {yk} and appropriately such that 
we can find Xk satisfying f{x k ) < <j>% for each k. The following lemma is a simplified 
version of Lemma 2.2.3 of Nesterov [5J p. 69]. 



Lemma 2.5. Let <j>o{x) 



2.4 preserves the canonical form of functions {(j) k (x)}: 

4>k{x) = 0fe + — 1|£C — -UfcHl, 



+ f 1 1 a; — vo\\i- Then the process defined in Lemma 
nctions {(f>k(x)}: 

(2.7) 



where the sequences {vk} and {4> k } are defined as follows: 

v k+1 = (1 - a k )v k + a k yk - —f'iVk), 

A 4 



+ a k (i - oik) (J^Wvk - "kill + (f'(yk),vk - yk}) 



Suppose we have 4>* k > f(xk) at the fc-th iteration. By (2.2) we know 



01. > /(yfe) + (f'(yk),x k - yk) + ^ IN - 2/fe lll- 



(2.8) 
(2.9) 



Plugging it into (2.9), we get 



<Afc +1 > /(ifc) - ^ll/'WII 2 + (1 - ak)(f'(Vk),a k (v k - y k ) + (x k ~ y k )) (2.10) 

/x(l — Qfc) / .. ||2 n „2\ 

H s ("fclN _ Z/fc 1 1 2 + iFfc - J/fcllaJ ■ 

Remember that yk is arbitrary. We can choose y k — (x k + a k v k )/(l + a k ) to eliminate 
the linear term associated with f'(y k ) and drop the sum of squares. Then we have 



<f>t + i>f(yk)-^\\f'(yk)M. 

Therefore, to make 4>1 + i > f(xk+i), it is sufficient to find an x k +\ such that 



f(xk + i)<f(yk)-^\\f(yk)\\l 



Because f'(x) is Lipschitz continuous with constant L, by choosing Xk+i = yk 



1 fi 



f'{yk) we can always ensure 



f(x k+ i) < /(y fe )- 2^11/' (y k )\\l 



(2.11) 



Comparing the two inequalities above, we see setting ak — \J fJ,/L = y/p would suffice. 
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Now we can further simplify the update scheme by knocking out {v k }. We have 



Vk+i 



Xk+i + avje+i 
1 + a 

Xk+i + a 



Xk+i + a 



(1 - a)v k + ay k - ff{yk) 



1 + a 



1 + a 

1 - a 

Xk+l + (Zfe+l - Xfc). 

1 + a 



We summarize this method in Algorithm [T] which is extremely simple. The term 
\~j^Jp is called the acceleration parameter. 



Algorithm 1 Af^L, Nesterov's constant step scheme, III [51 p. 81] 

1: Given f(x) € S^l and x , set p = fx/L and yo = x . 
2: for k = 0, 1, . . . until convergence do 

3: X k+ i =Vk~ zf'iVk) 

4: y k +i = X k+ i + - X fc ) 

5: end for 



Let 



f(xo), then Lemmas 12.31 and 2.4 characterize the convergence of J\fuL- 



The following theorem is a simplified version of Theorem 2.2.3 of Nesterov [SJ p. < 
Theorem 2.6. A/^,l (Algorithm^ generates a sequence {x k } such that 



f(x k )-r < a - vpr (/(so) + - x*y - /* < L(i-^) fc iix 



(2.12) 



Note that Nesterov actually provides three variants in [B] and what we mentioned 
here is the third one. For the other two, {a k } is not a constant sequence but de- 
terministic and having oik \fp as fc — > oo; hence the asymptotic convergence rate 
is still 1 — Jp~. In practice, they perform quite similarly, while the third is the least 
expensive among the three variants. 



3. Accelerating Nesterov's method with adaptive a k . In A/" Mj l, the rate 
of decrease of f{xk) — f* at the fc-th iteration is bounded by 1 — ak, where a^ = ^[p 
for all k. Our modified method is based on the following idea: trying to make a k 
larger than ^fp at each iteration in order to accelerate the convergence. To see how it 
works, we need to revisit Nesterov's construction, particularly the inequality (2.10). 
Given (2.10), it is sufficient to find a*,, yk, and Xk+i such that 



f{x k+ i) < f(y k ) 



gll/'WIl! 



(1 - a k )(f(yk),a k (v k - y k ) + (x k - y k )) 

M 1 — a k) i n n2 i n i|2\ 

2 {a k \\v k - Vkh + \\Xk ~ Vkh) 
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to retain ( |2.4[ ): f(x k+ i) < <j>l+i- The goal of finding an a k 6 [0, 1] as large as possible 
leads us to the following optimization problem: 



maximize a k G [0, 1] 
subject to f(x k+1 ) < f(y k ) 



°V(»)ll2 



2p' 



(3.1) 



+ (1 - a k )(f(y k ),a k (v k - y k ) + (x k - y k )) 

/i(l — a k ) , .. ..« n n2\ 

H ^ (a k \\v k - y k \\ 2 + \\x k - y k \\ 2 ) , 

where a k , y k , and x k +\ are free variables, while v k is determined at step k. Apparently, 
one optimal solution is given by a* k — 1 and x* k+l — y* k = x* . However, x* is unknown 
and y k and x k +i should be derived from past iterates and gradients. So we can 
only expect a sub-optimal solution that is good and easy to obtain. To restrict the 
optimization problem, we fix the choices of y k and x k+ i, following Nesterov: 



Vk 



x k + a k v k 



1 



Xk+i = yk 



f'iUk) 



l + a k ' "** L 

The choice of y k eliminates the linear term associated with f'(y k ) and we have 



(3.2) 



{a k \\v k - y k \\ 2 + \\x k - y k \\ 2 ) = 



a k , 



2(1 + a k ) 



\\%k - Vk\\l- 



Plugging (3.2 1 into (3.1|, we get 



maximize a k € [0, 1] 
subject to }{x k+1 ) < f(y k ) 



'lis 



aku f(y k )\\l 



iia k (l - a k ) 
2{l + a k ) 



\\x k -v k \\\. (3.3) 



Since evaluating the function costs time, we would be better to eliminate f(x k+ i) and 
f(y k ) from the above inequality. Note that f'(x) is Lipschitz continuous and hence 



the choice of x k+ i implies (2.11). Reinforcing the inequality (3.3) by (2.11), we get 



maximize a k 



subject to (al 



, ( x k + a k v k \ 
V l + a k J 



< 



M iFfe 



v k\\ 2 



a k ) 



1 + Qfe 



(3.4) 



where a k G [0, 1] is implied by the constraint. Now a k is the only free variable. The 
constraint always holds if a k = ^fp. Moreover, the constraint is not tight at a k = ^fp 
if x k ^ v k , which is generally the case. So we can almost always expect an a k > ^fp 
at each iteration. However, the problem is still nonlinear and solving it may lead to 
many function calls to the gradient function, which is inefficient because with those 
gradient calls we can proceed with the same number of iterations in A/" Mi l. We try to 
solve this problem approximately with the hope of getting a k as large as possible in 
one or two gradient calls. The idea is inspired by the following lemma. 

Lemma 3.1. Given f(x) G S^.l, let the pair of sequences {4> k (x) = 4>* k + ^\\x — 

If for some sequence {x k } we 

By 



and 

X . 



2.5 



ffcll^} and {\ k } be as defined in Lemmas 2.4 
have f(x k ) < 4> k for all k, then lim^oo v k 

Proof. The pair of sequences {4> k (x)} and {A^} is an estimate sequence 
definition we have 



M**) < (1 - A fc )/(z*) + AfcMx*), Vfc > 0, 
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Fig. 3.1. Choosing a^: 1) ct^ = c*o = yTp, we always have ?7fc(ao) < 0, 2) = 7j., the positive 
root of rik(a), is an aggressive choice because we don't always have 1)||2 > l|/'(2/fc)ll 2 < 

a k = Phi the local minimum of is a generally safe choice if ftj > cvq- 



and linifc^oo X k = 0. Given f(x*) < f{xk) < 4>1, we know 

|H - a=*Hl = <fe(s*) - 01. < (1 - A fc )/(a:*) + \ k M**) - f(x*) = A fc (^,(a;*) - f(x*))- 

Letting k — ¥ oo on both sides, we have lim^oo ||i>ft — x*||| = and hence lim^oo v k = 
x*. □ 



As long as j/fc is chosen as in ( |3.2[ ), by Lemmas |2 . 3| and |3 . 1 1 we have lim^—^oQ y k — %* 
and thus the global trend for | f'(yk)\\2 is decreasing. So if assuming the change 
between two contiguous iterations is small, we can use ||/'(j/fc-i)||2 as an approximate 
upper bound on ||/'(?/fc)ll2 to save the cost of evaluating gradients since \\f {y k -\)\\2 
is already calculated in the previous step. The modified constraint is therefore 

K ~ P) 11/ (yk-i)h ^ rr \\ x k - Vkh, (3.5) 

i + a k 

which is equivalent to 

a\ + (1 + D k )a\ -{p + D k )a k - p < 0, 

where D^, = /i 2 !!^ — ^fclll/ll/^yfe-i)!!!- Let's consider how to pick an a k at each step. 
Define 

rj k (a) = a 3 + (1 + D k )a 2 - (p + D k )a - p. 

If D k = (xfc = Wfc), then the largest a satisfying i] k (a) < is ^/p. Assume that 
£>fc > and p < 1. It is easy to verify the following properties of by checking 

its first and second derivatives: 

• Vk(ao) < 0, where a = y/p, 

• tiki®) has exactly one positive local minimum, denoted by f3 k , 

• rj k {a) has exactly one positive root, denoted by 7fc. 



Figure 3.1 shows a typical plot of rj k {a) with «0j /3ft, and 7ft. Note that (3 k is not 
necessarily larger than ao- Choosing a k — ao always leads to a valid estimate sequence 
that guarantees convergence. Given ao as our fallback choice, we try to be more 
aggressive. a k — 7^ is apparently the most aggressive choice. However, if we choose 



OL k = 7ft, (3.4) may break frequently because ||/'(yfe-i)||2 is not always an upper 



bound on ||/'(y/c)||2- If /3ft > a o, OL k = /3 k may be a safe choice that is more robust 



to the violation of ||/'(?/fc-i)||2 > \\f'{yk)\\2- Based on these observations, we propose 
four heuristics (from conservative to aggressive) to pick an ot k and compare their 
performance later in section [5] They are as follows: 

1. a k = max(ao,/3fc), 

2. a k = |(a + lk), 

3. a k = 2(max(a ,/? fe ) +7fe)- 

4. a k = 7 fc . 



As mentioned before, having an a k satisfying constraint (3.5) doesn't imply that a k 



is feasible in (3.4). If our guess doesn't meet the constraint, we fall back to Nesterov's 
choice a k = ^fp without making extra effort in searching for an a k > ^fp. Therefore, 
the modified method calls the gradient function at most twice per iteration and has 
at least the same rate of convergence as M^x in terms of number of iterations. 

We summarize our modified method in Algorithm |2j and refer to it as M^ L . To 
differentiate the four heuristics we proposed to pick an a k , we call the corresponding- 
variants Af^l, M*f, M"f, and Af*£, respectively. 

Algorithm 2 M" L , Nesterov's constant step scheme with adaptive a k 



Given f(x) € S^.^IR") and xo, set vq = ijq = xq and ao = ^fp = \fpjL. 

Compute xx = y Q - lf'(y )- 

for k = 1,2,... until convergence do 

Compute v k = (1 - a k -x)v k -i + a k -iyk-i - ^rf'iVk-i)- 

Let D k = p, 2 \\x k - Ufe||i/||/'(2/fe-i)||2- Choose an a k > ^fp such that 



a% + (1 + D k )ai - (p + D k )a k - p < 0. 

6: Compute y k = (x k + a k v k )/{\ + a k ) and x k+ i = y k - ^f'{y k ). 
7: Validate whether we have f(x k +\) < 4> k+1 by verifying a more stringent in- 
equality 

{^-p)\\nm\<A\^-v k \ t k \\ ~ ak) - 

l + a k 

8: If valid, let a k = a k , y k = y k , and x k+ i = x k+1 . 

Otherwise, let a k = ^/p, y k = (x k + a k v k )/(l + a k ), and x k+1 = y k - j;f'(y k ). 
9: end for 



4. Related work. In this section, we discuss related work on accelerating Nes- 
terov's methods Ml and M^^l- In both Ml and M^^l, the global Lipschitz constant 
L is assumed to be known. However, L might be difficult to get, and even if L is 
given, local Lipschitz constants may be much smaller than L such that the step size 
•g becomes too conservative. A widely adopted solution is backtracking linesearch, 
where the step size is adaptively chosen. Tseng [TU] presented a sufficient condition 
on the step size to preserve the convergence rate of Ml- Becker et al. [1,, §5.3] pro- 
posed an alternative condition that is numerically more stable to verify, and they 
also discussed implementation issues. Gonzaga and Karas [3] developed a linesearch 
scheme that preserves the convergence rate of M^l when only p is given. Linesearch 
schemes generally do not need explicit knowledge of L, but a single search may require 
evaluating the objective function for several times. Hence, even if L is provided, it 
is still problem-dependent whether we should use the constant step Ml/M^^l or a 
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backtracking linesearch. 

On strongly convex functions with Lipschitz gradient, Ml may converge at a 
rate 0(l/k 2 ) while even the steepest gradient descent method has linear convergence. 
Note that the optimal method M^l takes the same form as Ml- The only difference 
is the acceleration parameter. Ml increases the acceleration parameter gradually. 
M^^l-, given the global convexity parameter fj,, sets the acceleration parameter to 
a constant that guarantees linear convergence at an optimal rate. However, [i is 
not always known. Nesterov [5] proposed a practical approach to discover strong 
convexity: restarting Ml after a certain number of iterations. Theoretically, whether 
we should restart Ml depends on the local condition number. Empirically, even with 
sub-optimal choices, linear convergence rate can be achieved. See Becker et al. [U 
§5.6] for more details. Gonzaga and Karas [3] developed an adaptive procedure to 
estimate /i at the cost of function evaluations. 

In this work, we assume that both fi and L are given and only the gradient 
function is used to maintain minimal cost per iteration. We save gradient calls based 
on the global trend of ||/'(yfc)ll2- We argue that there are many cases where ^ and L 
are easy to obtain. L can be easily estimated for a quadratic function, or derived from 
a smooth approximation of a non-smooth function [TJ, and \x can be derived from a 
quadratic regularization term, e.g., % \x — c|| 2 , or by adding a quadratic term to the 
objective manually and then performing sequential updates. 

5. Numerical experiments. We compare the four variants of M^ L with Ml 
and Mfi t L- We implement M" L in MATLAB. The source code is available for down- 
load^ together with code that can be used to reproduce our results. Ml doesn't take 
\i as input and converges with rate 0{\/k 2 ). To recover linear convergence, as sug- 
gested by Nesterov [8] and Becker et al. p], we restart Ml after a certain number of 
iterations. The optimal number of iterations between restarts is problem-dependent. 
For each test, we restart Ml every 10, 100, and 1000 iterations respectively, compare 
the convergence rates with Ml without restart, and present the best result. The ex- 
periments were performed on a laptop that has two Intel Core Duo CPU cores at 
clock rate 2.0GHz and 4GB RAM. Only one core was used to remove the effect of 
multi-threading. We compare the convergence based on number of gradient calls and 
on running times, rather than on number of iterations, because Nesterov's methods 
call the gradient function exactly once per iteration, but M^ L may call the gradient 
function twice per iteration. The running times were measured in wall-clock times. 

5.1. Ridge regression. Our first test is on a ridge regression problem, i.e., a 
linear least squares problem with Tikhonov regularization: 

minimize f(x) = ^\\Ax - b\\\ + ^\\x\\l, 

where A £ Jj mx " i s the measurement matrix, b £ K m is the response vector, and A > 
is the ridge parameter. The unique solution is given by x* — (A T A + XI)^ 1 A T b. 

f(x) is a positive definite quadratic function, the simplest function type in the 
S^l family. f(x) has Lipschitz gradient with constant L — \\A\\?, + A and strong 
convexity with parameter /i = A. It is easy to show that M^^l automatically achieves 
better convergence rate on positive definite quadratic functions by exploring the 
eigenspace. We have 

\\x k -x*\\ 2 <C (l-y/p) k \\x -x*h 

1 http : //www. Stanford. edu/~mengxr/pub/acc_nesterov. html 
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Fig. 5.1. On a ridge regression problem. Top left: f — f* vs. number of gradient calls. Top 
right: f — /* vs. running time. Bottom left: \\x — x*\\2 vs. number of gradient calls. Bottom right: 
\\x — x*\\2 vs. running time. In terms of convergence speed, we have LSQR > A/*^ > Af"'^ > 
Af°'' > A/" Mi l > ^u'h > jVl' Nu'l * s * 00 aggressive and should be used with caution. 



for some constant Co > and hence 

/(**) - r < §in - x*\\% < - ^- P ) 2k \\x Q - x*\\. 

We omit the proof because it is purely mechanic work. Another important fact about 
positive definite quadratic functions is that there exist algorithms that can achieve 
the lower complexity bound derived by Nesterov [5J p. 68], e.g., the conjugate gradi- 
ent (CG) method. We refer readers to Luenberger [J] for a detailed analysis of CG's 
convergence rate. For least squares problems, LSQR [H] is preferable because LSQR 
is equivalent to applying CG to the normal equation in exact arithmetic but numer- 
ically more stable. The purpose of this test is not to compete with LSQR, which is 
specifically designed to solve least squares problems, but to treat LSQR as an ideal 
method and see how Af£ L can reduce the gap between N^.l and the ideal method on 
the simplest function family in S^.l- 

We choose m = 1200, n = 2000, and A = 1.0. We generate A from UT,V T where 
U E M mxm and V £ R nxm are orthonormal matrices chosen at random, S € M mxm 
is a diagonal matrix with diagonal elements linearly spaced between and including 
100 and 1. b — randn(m, 1) is a random vector whose entries are i.i.d. samples drawn 
from the standard normal distribution. Although the exact value is known, ||^4||| is 
estima ted by applying the power method to AA T . We have il = 1 and L 10001. 



Figure 5.1 shows the comparison results. LSQR leads as expected. Af"' L , A/^' L , and 
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Fig. 5.2. On an anisotropic bowl. Top left: f — f* vs. number of gradient calls. Top right: 
f — f vs. running time. Bottom left: \\x — x*\\2 vs. number of gradient calls. Bottom right: \\x—x*\\2 
vs. running time. All the variants of JV*" L converge significantly faster than M^.l or Ml- 



■^u'l f° rm the second group with Af"' L having a slight edge. AT"' L falls behind all 
other variants of Af" L and M^l because it is too aggressive on choosing an and 
falls back to a/. — yfp frequently. Hence Af"'^ should be used with caution. Ml, even 
with restart, is the slowest among competitive methods. We see Af^'^ approximately 
reduces the gap between Af u ,L and LSQR by a factor of 30% in terms of number of 
gradient calls. 

Anisotropic bowl. The second test is on a bowl-shaped function, which is 
anisotropic along different directions: 



minimize f(x) = 



1=1 



i ' x %) + o H^Hi 



subject to ||x||2 < t, 

where we use xu\ to indicate the i-th element of x. We put a constraint to make f(x) 
have a Lipschitz continuous gradient over the feasible region. If Xk falls outside the 
feasible region, we project it back to the nearest feasible point. By doing so, we know 
the function value will be decreased, so the convergence result still holds. We use this 
example to test the performance of Af" L and competitive methods when the gradient 
has local Lipschitz constants that are much smaller than the global one. 

We choose n = 500, r = 4, and Xq — ^=1- With these choices, we have L = 
YlriT 1 + 1 = 96001 and /j, = 1. Figure 5.2 draws the convergence results. We see that 
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Fig. 5.3. Projected trajectory of {x^} on the plane spanned by x^ and X( n y A/" Mi l and l 
are almost following the same path, though M^^l makes little progress per step, while L has many 
large steps. We say L accelerates Af^ ^ in this sense. 



all the variants of Af^ L converge significantly faster than N^.l or Ml- For example, 
to reach /(x/c) — f* < 10 -12 , variants of Af?' L take about 200 gradient calls, Af^x 
takes 5500 gradient calls, and Nl takes 7000 gradient calls. The differences among 
the four variants of Af" L are really small. 

To investigate further, we plot the projected trajectory of {xk} on the plane 
spanned by and X(„) for each method. In Figure 15.31 we see that the point 
sequences generated by and A/^l are almost following the same path. However, 
Nfi,L makes very little progress per step, while L jumps along the path. In this 
sense we say Af° L is indeed accelerating M^l- 

Smooth-BPDN. The third test is on a smoothed and strongly convex version 
of the basis pursuit denoising (BPDN) problem of Chen et al. [2J: 



minimize f(x) = -\\Ax — 6|| 2 + A||a;||^ 1)T + ^||.t" 2 



where || • ||^ 1)T is given by 



2" " z ■ 2 ' 



i if|»|>T 



2- 




if \x\ < r 



if a; is a scalar and H^II^.t = J2i=i \\ x (i)\\ii,T if x is a vector in K™. || • \\e liT is a 
smoothed version of the l\ norm, also recognized as the Huber penalty function with 
half- width t. A > and p > are parameters controlling the penalty terms. The 
quadratic term f makes the function strongly convex. f(x) has Lipschitz gradient 
with constant L = H-AHf, + — + p and strong convexity with parameter p = p. 

We set A = 4j -randn(m, n), where m = 800 and n = 2000, A = 0.05, r = 0.0001, 
and p = 0.05. The true signal is a random sparse vector with 40 nonzeros. b — Ax* +e, 
where e = 0.01^= • randn(m, 1) is a Gaussian noise. is estimated by applying 

the power method to AA T . The value is around 1.63. Hence we have 

L = \\A\\l + A + p fa 502.7 and p = 0.05. 
r 

There is no analytic solution for this problem. We apply A/^l to the problem with a 
small tolerance on the gradient norm and use the approximate solution returned by 
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Fig. 5.4. Smooth-BPDN. Top left: f — /* vs. number of gradient calls. Top right: f — /* vs. 
running time. Bottom left: \\x — x*\\2 vs. number of gradient calls. Bottom right: \\x — x*\\2 vs. 
running time. All the variants of AT" L converge faster than N^^l or Ml- 



A/" m ,l as the optimal solution. Figure 



5.4 



presents the results. All variants of Af" L 
run faster than N^l or A/l. It takes about 750 gradient calls for Af" L to reach 
f(xk) - f* < 10~ 12 , 1300 for jV m ,.l, and 1900 for Af L . The corresponding running 
times are around 5, 7.5, and 11.5 seconds, respectively. A/""^ is slow at the beginning 
but becomes the fastest method at the end. However, the differences among the four 
variants of Af" L are not big. 

Though the purpose of this test is not to recover sparse signals but to compare 
N£ L with competitive methods, we show that smooth-BPDN does recover sparse 
signals and hence it has practical value as well. Figure |5.5| compares the smooth- 
BPDN solution with the exact signal. We see the smooth-BPDN solution is very 
similar to a soft-thresholded version of the exact signal. It recovers all the coefficients 
with large magnitude. 

In summary, the proposed method can effectively accelerate Nesterov's 

method N^x in all the tests we present. Among the four variants, the first, sec- 
ond, and the third perform quite similarly. The fourth, the most aggressive one, may 
fall back frequently, as we see in the ridge regression case. Though it is the fastest 
method in the smooth-BPDN test, we don't recommend it in general. Since the first 
heuristic is the most conservative one and delivers comparable performance in all the 
three tests, we suggest using 7V""£ as the default setting. 



6. Conclusion and future work. We modified Nesterov's constant step gra- 
dient method for strongly convex functions with Lipschitz gradient such that, at each 
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Fig. 5.5. Exact signal vs. smooth-BPDN solution. The smooth-BPDN solution is very close to 
a soft-thresholded version of the exact signal. Smooth-BPDN solution recovers all the coefficients 
with large magnitude. 



iteration, we try to choose an > *J~p adaptively while preserving the estimate se- 
quence, where a>k controls the rate of decrease. Af™ L , the modified method, has at 
least the same convergence speed as Nesterov's method. Though it may evaluate the 
gradient function twice per iteration, in practice it effectively accelerates the speed of 
convergence for many problems. We propose four heuristics for choosing a^, compare 
their performance in the numerical experiments, and suggest a default one to use. 

Note that we don't utilize all the degrees of freedom in constructing our method. 
The sequences {yk} and {xk} are still following Nesterov's, so that we can reduce the 
number of calls to the gradient function. However, further exploration on the choices 
of {j/fe}, {xk}, and may help discover more efficient methods or help design 

variable step size methods. We leave those possible directions as our future work. 

The authors would like to thank Michael A. Saunders for useful comments on a 
previous draft of this paper. 



REFERENCES 



[1] S. R. Becker, E. J. Candes, and M. C. Grant, Templates for convex cone problems with 

applications to sparse signal recovery, Math. Prog. Comp., 3 (2011), pp. 165—218. 
[2] S. S. Chen, D. L. Donoho, and M. A. Saunders, Atomic decomposition by basis pursuit, 

SIAM J. Sci. Comput., 20 (1998), pp. 33-61. 
[3] C. C. GONZAGA AND E. W. Karas, Fine tuning Nesterov's steepest descent algorithm for 

differentiable convex programming, tech. report, Federal University of Paran, Brazil, 2008. 
[4] D. G. Luenberger, Introduction to Linear and Nonlinear Programming, Addison- Wesley, 

1973. 

[5] Y. Nesterov, A method of solving a convex programming problem with convergence rate 

0(l/k 2 ), Soviet Math. Dokl., 27 (1983), pp. 372-376. 

[6] , Introductory Lectures on Convex Optimization: a Basic Course, Springer, 2003. 

[7] , Smooth minimization of non-smooth functions, Math. Program., 103 (2005), pp. 127- 

152. 

[8] , Gradient methods for minimizing composite objective function, tech. report, Center for 

Operations Research and Econometrics (CORE), Universite Catholique dc Louvain, 2007. 

[9] C. C. Paige and M. A. Saunders, LSQR: An algorithm for sparse linear equations and sparse 
least squares, ACM Trans. Math. Softw., 8 (1982), pp. 43-71. 
[10] P. Tseng, On accelerated proximal gradient methods for convex-concave optimization, submit- 
ted to SIAM J. Optim., (2008). 



13 



